WORLD INTEIXECrUAL PROreRTY ORGANIZATION 
Intematkma] Bureau 




PCX 

INTERNATIONAL APPUCATION PUBUSHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) Int^matioiial P&tent Classification ^ : 
GOIN 27/447 



Al 



(11) IntcrnatioiiaJ PubUcaUon Number: WO 97/02488 

(43) International Publication I>ate: 23 January 1997 (23.01.97) 



(21) International AppUcation Number: PCTAJS96/1 1 130 

(22) International Filing Date: 28 June 1996 (28.06.96) 



(30) Priority Data: 
08/497;i02 



30 June. 1995 (30.06.95) 



US 



(60) Parent Application or Grant 
(63) Related by Continuation 
US 

Filed on 



08/497.202 (CON) 
30June 1995 (30.06.95) 



(71) Applicant (for all designated States except US): VISIBLE GE- 

NETICS INC- [CA/CAJ; 700 Bay Street, Toronto, Ontario 
M5G IZ6 (CA). 

(72) Inventors; and 

(75) InTentors^Applicants (for US only): GREEN, Ronald, J. 
lUS/CA]; 251 Queens Quay #705, Toronto, Ontario (CA). 
CHI, Vrijmoed [CA/CA]; 3125 Fifth Line West, Missls- 
sauga, Ontario (CA). GILCHRIST, Rodney, D. [CA/CA]; 
1114 Edgehill Place, Oakville, Ontario (CA). DEE, Gre- 
gory (CA/CAJ; 40 McPherson Place, Toronto, Ontario (CA). 
STCVENS, John. K. lUS/CA]; 540 Huron Street, Toronto, 
Ontario (CA). 



(74) Agents: LARSON, Marina, T. ct al.; Oppedahl & Laison. 1992 
Conunerce Street #309. Ycnictown Heiglsts, NY 10598-4412 
(US). 



(81) Designated States: AL, AM, AT. AU, AZ, BB, BG, BR. BY. 
CA. CH. CN, CZ, DE, DK, EE, ES, R, GB, GE, HU. IL. 
IS. JP. KE, KG, KP. KR, KZ. LK. LR. LS. LT, LU. LV. 
MD, MG. MK, MN, MW. MX. NO. NZ. PL. PT. RO. RU, 
SD, SB. SG. SI. SK. TI, TM. TO. TT. UA. UG. US. UZ, 
VN. ARIPO patent (KB. LS. MW. SD. SZ, UG). Eurasian 
patent (AM. AZ. BY. KG. KZ, MD, RU, TI, TM). European 
patent (AT. BE. CH, DE. DK. ES. FI. FR. GB. GR, IB. IT. 
LU. MC, NL, PT. SE). OAPI patent (BF, BJ. CF. CG, CI, 
CM, GA, ON. ML. MR. NE, SN. TD, TG). 



Published 

Wi^ international search report. 



(54) Title: METHOD AND SYSTEM FOR DNA SEQUENCE DETERMINATION AND MUTATION DETECnON 
(57) Abstract 

Nonnallzation of experimental fragment patterns for nucleic acid polymers having putatively known sequences starts with obtaining 
at least one raw fragment pattern for the experimental sample. TTie raw fragment pattern represents the positions of a selected nucleic acid 
base within the polymer as a function of migration time or distance. This raw fragment pattern is conditioned using conventional baseline 
correction and noise reduction technique to yield a clean fragment pattern. The clean fragment pattern is then evaluated to detmnine one or 
more 'hionnalization coefficients", lliese normalization coefficients reflect the displacement, stretching or shrinking, and late of stretching 
or shrinking of the clean fragment, or segments tfiereof. which are necessary to obtain a suitably high degree of correlation between the 
clean fragment pattern and a standard fragment pattern which represents the positions of the selected nucleic acid base within a standard 
polymer actually having the known sequence as a function of migration timeor distance. The normalization coefficients are then applied 
to the clean fragment pattern to produce a normalized fragment pattern which is used for base-calling in a conventional manner. This 
method may be implemented in an apparatus comprising a computer processor programmed to determine normalization coefficients for an 
experimental fragment pattern. This computer may be serapate from the electrophoresis apparatus, or part of an integrated unit 




FOR THE PURPOSES OF INFORMATION ONLY 



Codes used to identify States paity to the PCT on the front pages of pamidilets publishing international 
applications under the PCT. 



AM 


Annenia 


GB 


United Kingdom 


MW 


Malawi 


AT 


Ausina 


GE 


CSeotgia 


MX 


Mexico 


AU 


Austnlia 


GN 


Guinea 


NE 


Niger 

Nedcilaiids 


BB 


B&rbados 


GR 


Greece 


ML 


BE 


BelghuD 


HU 


Hnngaiy 


NO 


Ncnway 


BF 


Btnkina Faso 


IE 


Ireland 


NZ 


New Zealand 


BG 


Bulgaria 


IT 




PL 


Poland 


BJ 


BCfUQ 


JP 


Japan 


FT 


Ponugal 


BR 


Brazil 


KB 


Kenya 


RO 


Roraania 


BV 


Belanis 


KG 


Kyigystan 


RU 


Russian indention 


CA 


Canada 


KP 


Democratic epic's Rqmblic 


SD 


Sudan 


CF 


Central Afirictn Repidilic 




of Korea 


S£ 


Sweden 


CG 


Congo 


KR 


RepDb&c of Korea 


SG 


Singapore 


CH 


Switzeriand 


KZ 


Kazakhstan 


SI 


Slovenia 


a 


COte dfvohe 


LI 




SK 


Slovakia 


CM 


Camerooo 


LK 


Sri Lanka 


SN 


Senegal 


CN 


China 


LR 


Liberia 


sz 


Swaaland 


cs 


CzBctioslovalcid 


LT 


Lithuania 


TD 


Chad 


cz 




IM 


Lnaonbouig 


TG 


Togo 


DE 


Gcnimy 


LV 


Latvia 


TJ 


TapJdstan 


DK 


Ddnsajfc 


MC 


Monaco 


TT 


Trinidad and Tobago 


EE 


Estonia 


MD 


RqmbUc of MoUofva 


UA 


Ukraine 


E5 


Span 


MG 


Madagascar 


UG 


Uganda 


n 




ML 


Mali 


US 


United States of America' 


FR 


Pmcc 


MN 


Mongolia 


uz 


Uzbekistan 


GA 


Gabon 


MR 


Mauritania 


VN 


Vicl Nam 



wo 97/02488 



PCT/US5Hft/11130 



.1- 

METHOD AND SYSTEM FOR DNA SEQUENCE 
DETERMINATION AND MUTATION DETECTION 

DESCRIPTION 

1. BACKGROUND OF THE INVENTION 

This invention relates to a method and system of nucleotide sequence determination 
and mutation detection in a subject nucleic acid molecule for use with automated 
electrophoresis detection apparatus. 

One of the steps in nucleotide sequence determination of a subject nucleic acid polymer 
is interpretation of the pattern of oligonucleotide fragments which results from electrophoretic 
separation of fragments of the subject nucleic acid polymer (the "fragment pattern"). The 
interpretation of the fragment pattern, colloquially known as "base-calling," results in 
determination of the order of four nucleotide bases, A (adenine), C (cytosine), G (guanine) and 
T (thymine) for DNA or U (uracil) for RNA in the subject nucleic acid polymer. 

In the earliest method of base-calling, a method which is still commonly employed, the 
subject nucleic acid polymer is labeled with a radioactive isotope and cither Maxam and Gilbert 
chemical sequencing (Proa Natl. Acad. Sci. USA, 74: 560-564 ( 1 977)) or Sanger ct al. chain 
termination sequencing (Proc. Natl. Acad. Sci. USA 74: 5463-5467 (1977)) is performed. The 
resulting four sarn)les of nucleic acid fragments (tcminating in A, C, G, or T(U) respectively 
in the Sanger et al. method) are loaded into separate loading sites at the top end of an 
electrophoresis gel. An electric field is applied across the gel, and the fragments migrate 
through the gel. During this electrophoresis, the gel acts as a separation matrix. The frag- 
ments, which in each sanfq;)le are of an extended series of discrete sizes, separate into bands of 
discrete species in a dianncl along the length of the gel. Shorter fragments generally move 
more quickly than larger fragments. After a suitable separation period, the electrophoresis is 
stopped. The gel may now be exposed to radiation sensitive film for the generation of an 
autoradiograph. The pattern of radiation detected on the autoradiograph is a fixed 
representation of the fragment pattern. A researcher then manually base-calls the order of 
fragments from the fragment pattern by identifying the stepwise sequence of the order of bands 
across the four channels. 




wo 97/02488 PCTArS96/11130 

-2- 

More recently, with the advent of the Human Genome Organization and its massive 
project to sequence the entire human genome, researchers have been turning to autcHnated 
DNA sequmcers to process vast amounts of DNA sequence information. Existing automated 
DNA sequencers are available fh)m Applied Biosystems, Inc. (Foster City, CA), Pharmacia 
5 Biotech, Inc. (Piscataway, NJ), Li-Cor, Inc. (Lincoln, NE), Molecular Dynamics, Inc. 
(Sunnyvale, CA) and Visible Genetics Inc. (Toronto). Automated DNA sequencers arc 
basically electrophoresis apparatuses with detection systems which detect the presence of a 
detectable molecule as it passes through a detection zone. Each of these apparatus, therefore, 
are capable of real time detection of migrating bands of oligonucleotide fragments; the 

1 0 fragment patterns consist of a time based record of fluorescence emissions or other detectable 
signals from each individual electrophoresis channel. They do not require the cumbersome 
autoradiography methods of the earliest technologies to generate a fragment pattern. 

The prior art techniques for computer-assisted base-calling for use in automated DNA 
sequencers are exemplified by the method of the Pharmacia A.L.FJ** sequencer. 

1 5 Oligonucleotide fragments are labeled with a fluorescent molecule such as fluorescein prior to 
the sequencing reactions. Sanger ct al. sequencing is performed and samples are loaded into 
the top end of an electrophoresis gel. Under electrophoresis the bands of species separate, and 
a laser at the bottom end of the gel causes the fragments to fluoresce as they pass through a 
detection zone. The fragment patterns are a record of fluorescence emissions from each 

20 channel. In general each fragment pattern includes a series of sharp peaks and low, flat plains; 
the peaks representing the passage of a band of oligonucleotide fragments; the plains 
representing the absence of such bands. 

To perform computer-assisted base-calling, the A.L.F. system executes at least four 
discrete functions: 1) it smooths the raw data with a band-pass frequency filter; 2) it 

25 identifies successive maxima in each data stream; 3) it aligns the smoothed data from each of 
the four channels into an aligned data stream; and 4) it determines the order of the successive 
maxima with respect to the aUgncd data stream. The alignment process used in the apparatus 
dqiends on the existence of very little variabih'ty between the lanes of the gel. In this case, the 
fragment patterns from each lane can be superimposed by alignment to a presumed starting 

30 point in each pattern to provide a record of a continuous, non-overlapping scries of sharp 
peaks, each peak representing a one nucleotide step in the. subject nucleic acid. Where a 




wo 97/02488 



PCT/D$96/11130 



.3- 



distinct ordering of peaks can not be made, the conputer identifies the presence of ambiguities 
and fails to identify a sequence. 

Other published methods of computer-assisted base-calling include the methods 
disclosed by Tibbetts and Bowling (US Pat. No. 5,365,455) and Dam et al (US Pat. No. 
5 5,1 1931 6) which patents are incorporated herein by reference. Tibbetts and Bowling disclose 
a method and system which relies on the second derivative of the peak slopes to smooth the 
data. The second derivath^e is used to provide an informative variable and an intensity variable 
to determine the nucleic acid sequence corresponding to the subject nucleic acid polymer. 
Dam et al. disclose a method of combining peak shapes from two signal spectrums derived 
10 from the same electrophoresis channel to determine the order of nucleotides in the subject 
nucleic acid polymer. 

Three practical problems face all existing methods and systems of base-calling. The 
first is the inability to align shifted lanes of data. If the signal from the related data streams 
does not begin at approximately the same time, it is difficult, if not impossible, for these 

15 techniques to determine the correct alignment. Secondly, it is a challenge to resolve "com- 
pressions'* in the fragment pattern: those anomalias wherein the signal from two or more 
nucleotides in a row are not distinguishably separated as compared to other nucleotides in the 
general vicinity. Compressions result most often from short hairpin loops at the end of a 
fragment which cause altered gel mobility features. The third problem is the inability to 

20 identify nucleotide sequences beyond the limits of single nucleotide resolution. Larger 
fragments tend to need longer electrophoresis runs to separate into discrete bands of 
fragments, in part because a one nucleotide addition to a 300 nt fragment is less significant than 
a one nucleotide addition to a 25 nt fragment. The limit of resolution is reached when 
individual bands can not be usefully distinguished. 

25 All of these problems limit the most crucial aspects of base-calling, which are speed, 

read-length and accuracy. Read-length is the number of fragment bands which can be 
identified from the fragment pattern. Greater read-length provides greater information about 
the DNA sequence in question. Accuracy measures the number of base-calling errors. 
Frequent errors are unacceptable since they alter the biological meaning of the DNA sequence 

30 in question. And, as described below, if DNA sequence dctcmiination is to be used as a tool 
for diagnostic purposes, base-calling errors can lead to misdiagnosis. 



wo 97/02488 PCTAJSW11130 

-4- 

The advent of DNA sequence-based diagnosis provides new opportunities for improved 
speed, accuracy and tead-Iength in computer-assisted base-calling. DNA sequence-based 
diagnosis is the routine sequencing of patient DNA to identify genotype and/or ^ecific gene 
sequences of the patient wherein the DNA sequence is reported back to the physician and 
S patient in order to assist in diagnosis and treatment of patient conditions. One of the great 
advantages of DNA sequence-based diagnosis is that the DNA sequence being examined is 
largely known. As demonstrated by the instant invention, it is possible to use the known 
fragment pattern for each DNA sequence to assist in the interpretation of the fragment pattern 
obtained from a patient sample to obtain improved read-length and accuracy. It can also be 
1 0 used to increase the speed of sample analysis. 

It is an object of the instant invention to provide a method and system for nucleotide 
sequence determination and mutation detection which can be used with DNA sequence-based 
diagnosis. 

It is a further object of the instant invention to provide a method and system for 
IS nucleotide sequence determination and mutation detection when the fragment pattern 
demonstrates localized compressions. 

It is a further object of the instant invention to provide a method and system for 
nucleotide sequence detcmiination and mutation detection when the fragment pattern does not 
provide single nucleotide resolution. 
20 It is a further object of the instant invention to provide a method and system of 

computcr-assivStcd base-calling which can be used with fragment pattern records from high 
speed electrophoretic separations which demonstrate less than ideal separation characteristics. 

11. SUMMARY OF THE INVENTION 

25 These and other objects of the invention arc realized by the application of a novel 

approach to the normalization of experimental fragment patterns for nucleic acid polymers 
having putatively known sequences. In this method, at least one raw fragment pattern is 
obtained for the experimental sample. The raw fragment pattern represents the positions of 
a selected nucleic acid base within the polymer as a function of migration time or distance. 

30 This raw fragment pattern is conditioned using conventional baseline correction and noise 
reduction technique to yield a clean fragment pattern. The clean fragment pattern is then 



wo 97/02488 PCT/nS9<K/11130 

-5- . 

evaluated to deteraiine one or more "normalization coeffidents." These normalization 
coefficients reflect the dfeplacen^nt, stretching or shrinking, and rate of stretching or shrinking 
of the clean fragment, or segments thereof, which arc necessary to obtain a suitably high 
degree of correlation between the clean finagment pattern and a standard fragment pattern 
5 which represents the positions of the selected nucleic add base within a standard polymer 
actually having the known sequence as a function of migration time or distance. The normali- 
zation coeffidents are then applied to the clean fragment pattern to produce a normalized 
fragment pattern which is used for base-calling in a conventional manner. 

In applying the present invention to the evaluation of nucleic acid polymers of 

1 0 putatively known sequence to the detection of well-characterized mutations in which one base 
is substituted for another at a constant site in the gene, it will generally be sufficient to 
determine noiinaHzation coefficients for a single fragment pattern reflecting the positions of 
either the normal or mutant base within the nucleic acid polymer. For more general applica- 
tions, however, it is desirable to determine separate normalization coefficients for each of the 

1 5 four oligonucleotide fragment patterns obtained for the sample by correlating them with four 
standard fragment patterns. 

The method of the invention is advantageously implemented in an apparatus comprising 
a corq[)uter processor programmed to determine normalization coefficients for an experimental 
fragment pattern. This conputer may be separate from the electrophoresis apparatus, or part 

20 of an integrated unit. 

III. BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 illustrates the effect of background subtraction and band-pass frequency 
filtration on the appearance of data. 
25 Figs. 2A, 2B, and 2C illustrates the correlation method of instant invention. 

Figs. 3 A, 3B and 3C illustrate the effect of increasing the number of segments into 
which the sample data is divided. 

Fig. 4 is a plot of preferred correlation shift against data point number. 
Figs. 5A and 5B illustrate alignment of data windows. 
30 Fig. 6 shows the process of "reproduction" using Genetic Algorithms. 




WO97y024g8 PCT/lJS9d/11130 

-6- 

Fig. 7 shows a binary genotype useful for finding values for a the coefficients of a 
second-order polynomial using Genetic Algorithms. 

Fig. 8 ilhistrates the exercise of base-calling of aligned data, as obtained from a 
Pharmacia Ai.F ™ and processed using HELIOS™ software. 
S Figs. 9A and 9B illustrate a sequencing compression. 

Fig. 10 iUustrates a cross correlogram which plots maximum correlation against data 
point number of shifted Origin across the entire length of sample and standard fragment 
pattems. 

Fig. 1 1 shows an apparatus in accordance with the invention. 

10 

IV. DETAILED DESCRIPTION OF THE INVENTION 

The instant invention is designed to work with DN A sequence-based diagnosis or any 
other sequencing environment involving nucleotide sequence determination and/or mutation 
detection for the same region of DNA in a plurality of individual DNA-containing samples 

15 (human or otherwise). This "diagnostic environment" is unlike the vast majority of DNA 
sequence determination now occurring in which researchers arc attempting to make an initial 
determinatk)n of the nucleotide sequence of unknown regions of DNA. DNA sequence-based 
diagnosis in which the DNA sequence of a patient gene is determined is one example of a 
technique performed within a diagnostic environment to which the present invention is 

20 applicable. Other examples include identification of pathogenic bacteria or viruses, DNA 
fingerprinting, plant and animal identification, etc. 

The present invention provides a method for normalization of experimental fragment 
pattems for nucleic acid polymers with putativcly known sequences which enhances the ability 
to interpret the infomiation found in the fragment patterns. In this method, at least one raw 

25 fragment pattern is obtained for the experimental sample. As used in the specification and 
claims hereof, the term "raw fragment pattem" refers to a data set representing the positions 
of one selected nucleic acid base within the experimental polymer as a function of migration 
time or distance. Preferred raw fragment pattems which may be processed using the present 
invention include raw data collected using the fluorescence detection apparatus of automated 

30 DNA sequencers. However, the present invention is applicable to any data set which reflects 
the separation of oligonucleotide fragments in space or time, including real time fragment 



wo 97/02488 



PCTAJS96/11130 



-7- 

patterns using any type of detector, for example a polarization detector as described in US 
Patent Application No. 08/387,272 filed February 13, 1995 and incorporated herein by 
reference; densitometer traces of autoradiograpbs or stained gels; traces from laser-scanned 
gels containing fluorescently-tagged oligonucleotides; and fragment patterns from sanples 
separated by mass spectrometry. 

This raw fragment pattern is conditioned, for example using conventional baseline 
correction and noise reduction technique to yield a "clean fragment pattern." As is known in 
the art, three methods of signal processing commonly used arc background subtraction, low 
frequency filtration and high frequency filtration. 

Background subtraction eliminates the minimum constant noise recorded by the 
detector. The background is calculated as a measure of the minimum signal obtained over a 
selected number of data points. This measure differs from low frequency filtration which 
eliminates low period variations in signal that may result from variable laser intensity, etc. 

High frequency filtration eliminates the small variations in signal intensity that occur 
over highly k>caliz€d areas of signal. The result aflcr base-line subtraction is a band-pass filter 
applied to the frequency domain: 

where (o determines the low-frequency cutofT and o determines the high frequency cutoff, 
respeaivcly. Fig. 1 illustrates the effect of background subtraction, low and high frequency 
filtration on the appearance of data from a Visible Genetics MicroGcnc Blaster™, resulting 
in a clean fragment pattern useful in the invention. 

In accordance with the present invention, a "clean fragment pattern" may be obtained 
by the application of these signal-processing techniques singly or in any combination. In 
addition, other signal processing techniques may be employed to obtained comparable clean 
fragment pattems.without departing from the present invention. 

One note of caution concerning this conditioning step is the finding that signal 
conditioning or pre-processing may delete features of consequence in the preparation of the 
clean fragment pattern. It is possible to include a feedback mechanism in the system which 
adjusts the parameters of the filter mechanisms, based on the analysis of the degree of 



wo 97/02488 



PCTAJS96/11130 



-8- 

correlation, described below. The feedback mechanism adjusts the types of filters employed 
in signal processing to provide the maximum information about the subject nucleic add 
sequence. 

The next step in the method of the present invention is the comparison of the clean 
fragment pattern with a standard ftagment pattern to determine one or more "normalization 
coefficients.- The use of a "standard fragment pattern" takes advantage of the fact that in a 
diagnostic environment, there is a known fragment pattern that is expected from each test 
sample. As used in the specification and claims of this application, the term "standard fragment 
pattern" refers to a typical fragment pattern which results from sequencing a particular known 
region of DNA using the same technique as the experimental technique being employed. Thus, 
a standard fragment pattern may be a time-based fluorescence emission record as obtained 
from an automated DNA sequencer, or it may be another representation of the separated 
fragment pattern. 

A standard fhigment pattern used in the present invention includes all the less-than- 
ideal characteristics of nucleotide separation that may be associated with sequencing of any 
partKular region of DNA. A standard fragment pattern may also tend to be idiosyncratic with 
the electrophoresis apparatus employed, the reaction conditions employed in sequencing and 
other factors. Fig. 2A illustrates a standard fragment pattcm for the T lane of the first 260 
nucleotides from the universal primer of pUClK prepared using Scquenasc 2.0 (United States 
Biochemical, Cleveland) and detected on a Visible Genetics Microgcnc Blaster(tm). Four 
standard fragment patterns, one for each nucleotide, makes up the standard fragment pattcm 
set for a particular nucleic acid polymer. 

A standard fragment pattern or fragment pattcm set for a particular nucleic acid 
polymer may be generated by various methods. One such method is to obtain several to 
several hundred actual fragment patterns for the DNA .sequence in question from samples 
wherein the DNA sequence is already known. From these trial runs, a human operator may 
select the trial run that Ls found to be the mosi typical fragment pattcm. Because of shght gel 
or sare^le anomalies, and other anomalies, different fragment patterns may have slightly 
different separation characteristics, and slightly different peak amplitudes etc. The selected 
pattern generally should not show discrete peaks in an area where compressions and overlaps 
are regularly found. Similaily, the selected pattern generally must not show discrete separation 



wo 97/02488 



-9. 



PCT/US9(i/11130 



of bases beyond the average single nucleotide resolution limit of the electrophoresis instrument 
used. An alternative method to select a standard fragment pattern is to generate a 
mathematically averaged result from a combination of the trial runs. As described 

below, the main use of the standard fragment pattern is as a basis for modifying and 
normalizing an e>q>erimaital fragment pattem to enhance the reliabihty of the interpretation of 
the cxperimratal data. Thus, the standard fragment pattem is not used as a comparator for 
identifying deviations from the expected or "normal" sequence, and in fact is used in a manner 
which assumes that the experimental sequence will conform to the expected sequence. 

A feature of the standard fragment pattern which is important for some uses is that it 
results in a minimum of (and preferably no) ambiguities in base-calling when combined with 
the standard fragment patterns from the three other sequencing channels. The human operator 
may prefer to empirically determine which fragment patterns from which lanes work best 
together in order to determine the standard fragment patterns for each sequencing lane. 

Additionally, it is well known in the art that a range of alleles for any gene may be 
present in a population. To be most uscftiL a standard fragment pattern should result from 
sequencing the dominant allele of a given population. Because of this, for some applications 
of the invention, multiple standard fragment patterns may exist for a specific gene, even within 
a single experimental environmoit. 

A standard fiagment pattem may be used in different ways to provide improved 
read-length, accuracy and speed of sample analysis. These improvements rely on comparison 
of an experimental sample fragment pattern with the standard fragment pattem to determine 
one or more "normalization coefficients" for the particular experimental fragment pattem. 

The normalization coefficients reflect the displacement, stretching or shrinking, and rate 
of stretching or shrinking of the clean fragment pattem, or segments thereof, which are 
necessary to obtain a suitably high degree of correlation between the clean fragment pattern 
and a standard fragment pattem which represents the positions of the selected nucleic acid base 
within a standard polymer actually having the known sequence as a function of migration time 
or distance. The normalization coefficients arc then applied to the clean fragment pattern to 
produce a normalized fragment pattem which is used for base-calling in a conventional manner. 

The process of comparing the clean fiagment pattem and the standard fragment pattem 
to arrive at normalization coefficients can be carried out in any number of ways without 




wo 97/02488 



PCT/US96/11130 



- 10- 



dq^arting from the present invention. In general, suitable processes involve consideration of 
a number of trial normaUzation s, and selection of the trial normalization which achieves the 
best fit in the model being employed. Several, non-limiting examples of useful cwnparison 
procedures are set forth below. The procedures result in the development of normalization 
5 coeifici'ents which, when applied to an experimental fragment pattern, shift, stretch or shrink 
the expmmental fragment pattern to achieve a high degree of overlap with the standard 
fragment pattern. 

It will be understood, that the theoretical goal of achieving an exact overlap between 
an experimental fragment pattern and a standard fragment pattern may not be realistically 

10 achievable in practice, nor are repetitive and time consuming calculations to obtain perfect 
normalization necessary to the successfril use of the invention. Thus, the term "high degree 
of normalization" refers to the maximization of the normalization which is achievable within 
practical constraints. As a general rule, a point-for-point correlation coefficient calculated for 
normalized fragment patterns and the corresponding standard fragment pattern of at least 0.8 

1 5 is desirable, while a correlation coefficient of at least 0.95 is preferred. 

Fig. 2 iUiLStrates one correlation method of instant invention. Fig. 2 A illustrates a clean 
fragment pattern obtained using a Visible-Goietics MicroGcnc Blaster™. The signal records 
the T lane of a pUC18 sequencing run over the first 260 nucleotides (nt) of the subject nucleic 
acid molecule. The Y axis is an arbitrary representation of signal intensity; the X axis 

20 represents a time of 0 to 5 minutes. In the sequencing mn shown, the peaks are cleanly 
separated. 

Fig. 2B represents the standard fragment pattcm for the T lane of the first 260 
nucleotides from the universal primer of pUCl 8 prepared using Sequenasc 2.0 (United States 
Biochemical. Cleveland) and detected on a Visible Genetics Microgene Blaster(tm). The 
25 standard sequence was selected by a human operator as the most typical fragment pattern from 
25 trial mns. 

The experimental fragment pattem of Fig. 2A may be compared with the standard 
fragment pattcm of Fig. 2B according to the equation: 




wo 97/02488 



PCT/DS96/11130 



-11 - 



M-\ 



m=0 



10 



15 



20 



where fl;x) and g{x) are two discrete functions, x = 0, 1,2 M- 1 , f is the con^lex conjugate, 

and M is one less than the sum of the data points in fi[x) and g(x). Alternatively, the equation 
may be described as 



in NextStep™ progranmiing environment, where 1 1 = the experimental fragment pattern, and 
21 = standard fragment pattern. 

Fig. 2C shows the correlation values of the entire window of Lane A against the entire 
window of Lane B as lane A is translated relative to lane B. (As the window is shifted, it 
effectively wraps around, such that the End and Origin points appear to be side by side). The 
result shows maximum correlation at point P which corresponds to a preferred correlation shift 
of +40 data points. 

Fig. 2 illustrates comparison of a complete experimental fragment pattern and a 
complete standard fragment pattern. In this case, the only normalization coefficient determined 
is the shift which results in the highest level of correlation. This simple model however, lacks 
the robustness which is needed for general applicability. Thus for most purposes, a more 
complex analysis is required to obtain good normalization. 

One way to take in to account the experimental variability in migration rate caused by 
inconsistency of sample preparation chemistry, sample loading, gel material, gel thickness, 
electric field density, clamping/securing of gel in instrument, detection rate and other aspects 
of the electrophoresis process Is to assign the data points of the clean fragment pattern to one 
or more segments or "windows." Each window includes an empirically determined number of 
data points, generally in the range of 100 to lOOOO data points. Windows may be of variable 
size within a given data series, if desired. The starting data point of each window is designated 
Origin; the final data point in a window is designated End, Each window of the experimental 



COR(ll,21) 




wo 97/02488 PCT/US9MlldO 

- 12- 

fragment pattern is then compared with a comparable number of data points making up the 
standard fragment using the same procedure described above. 

Figs 3A and 3B iltustrate the effect of increasing the number of segments or windows 
into which the experim^tal data is divided. In Fig. 3 A, the experimental fragment pattern 
S from Fig. 2A was divided into three windows, and each was evaluated individually. Instead 
of the single offset of +40 data points found using a single window, the use of diree windows 
results in an increasing degree of shift throughout the run, ix., +24, +34 and +50 in the 
successive windows reading fh)m right to lefl. Fig. 3B shows the use of five windows on the 
same experimental fragment, and results in even clearer resolution, with successive shifts of 
10 +16, +23, +35, +48, and +51 for the windows. Simply put, the consequence of too few 
windows is a lack of precision in shifting information. This may cause problems in base-calling 
aligned data. It is thraieforc desirable to use more than one window in the correlation process. 

When more than one window is used in the analysis as illustrated in Fig 3, it may 
1 5 become necessary to stretch or shrink some windows to obtain a continuous stream of data in 
the correlated data and to obtain a sufficiently high correlation. To calculate stretch or shrink 
("elasticity") for a window where a plurality of windows arc defined, one can use an "elasticity 
plot" of preferred correlation shift against data point number (Fig. 4). An increased number 
of windows increases the number of points on the elasticity plot, allowing more accurate 
20 determination of the slope and offset of the alignment line. 

It has been found experimentally that in an elasticity plot, a linear equation representing 
the least mean square fit adequately represents the data. In this case the linear equation 

f(x) = mx + b 

25 

wiU satisfy the line, where m Ls the slope of the line and b is the Y intercept ("offset") expressed 
in number of data points. The equation of the line is used to shift the sample data where the 
value at shifted(i) of the shifted data is given by: 



30 



shifted(i) = sample(i) +((samplc(i) * m) + b) 




W097/D2488 



PCT/US96/11130 



•13- 



10 



15 



20 



25 



Note that as illustrated in Fig. 5 when the peaks identified in the sample window (Lane 
B) do not align with the standard data (Lane A), (Fig 5A) they may be aligned for analysis 
purposes by padding the elastically shifted data with zeros wheii the formula produces values 
outside of the sample data's range (Fig. 5B). 

While the use of multiple windows increases the accuracy of the alignment, a potential 
problem arises when too many windows arc used. As illustrated in Fig. 4C, when windows 
include too few features, the correlation between the data and the window and the standard 
fragment pattern becomes meaningless. In Fig. 4C, window size has dropped below 1 000 data 
points. One window which includes a single peak is found to have highest correlation with a 
peak distantly removed from the location where it would otherwise be expected to correlate. 
This situation demonstrates that the human operator must be sensitive to the. unique 
circumstances of each standard fragment pattern to determine the optimum number of data 
points per window. 

One method wherein windows with fewer data points can be employed is to limit the 
amount of the standard fragment pattern against which the window is correlated. Again, such 
a limitation would be empirically determined as in the other data filters employed. It is found 
experimentally that correlation of a sample window with that region of the standard fragment 
pattern that falls approximately at the same number of data points from the start of signal, and 
includes twice as many data points as the sample window, is sufficient to obtain correlations 
which are not often spurious. 

An altemative approach to the determination of normalization coefficients which is 
appUcablc whether the experimental fragment pattern is considered in one or several segments 
makes use of an adaptive computational method known as a "Genetic Algorithm." See Holland, 
J., Adaptation in natural and Artificial Systems. The University of Michigan Press (1975). 
Genetic Algorithms (GAs) are particularly good at solving optimization problems where 
traditional methods may fail In the context of normalization of nucleic acid fragment patterns, 
GAs are particularly suited to use in experimental conditions where variations in velocity may 



The conceptual basis of GAs is Dai-winian "suwival of the fittest" In nature, 
individuals compete for resources (e.g., food, shelter, mates, etc.). Those individuals which 



exceed 5%. 



wo 97/02488 



PCTAJS96/11130 



-14- 

are most highly adapted for their environment tend to produce more offspring. GAs attempt 
to mimic this process by "evolving" solutions to problems. 

GAs operate on a "population" of individualvS, each of which is a possible solution to 
a given problem. Each individual in the starting population is assigned a unique binary string 
which can be.considered to represent that individual's "genotype." The decimal equivalent of 
this binary genotype is referred to as the "phenotype." A fitness function operating on the 
phenotype reflects how well a particular individual solves the problem. 

Once the fitness of every individual in the starting population has been determined, a 
new generation is created through reproduction. Individuals arc selected for reproduction 
from the starting population based on their fitness. The higher an individual solution's fitness, 
the greater the probability of it contributing one or more offspring to the next generation. 
During reproduction, the number of individuals in a population is kept constant through all 
genoations. "Reproduction" results from combining the genotypes of the individuals from the 
prior generation with the highest fitness. Thus, as shown in Fig. 6, in a population of four 
individuals, portions of the genotype of the two individuals with the highest fitness arc 
exchanged at a randomly selected cross-over point to yield a new generation of oflT-spring. 

A further aspect of the Genetic Algorithm approach is the random use of "mutations" 
to introduce diversity into the population. Mutations arc performed by flipping a single 
randomly selected bit an individual's genotype. 

Arriving at a solution to a problem asing GAs involves the repeated steps of fitness 
evaluation, reproduction (through cross-over), and possibly mutation. Each step is sinrq>ly 
repeated in turn until a the population converges within predetermined limits upon a single 
solution to the given probl^. A pseudo-code implementation of this process is shown in 
Table 1. 

It will be appreciated by persons skilled in the art that the task of aligning fragment 
patterns needs to take into account a great many experimental variations, including variations 





wo 97702488 



PCT/US96/11130 



-15- 



Table 1: Genetic Algorithm in Pseudo-code 



BEGIN 



generate initial population 

compute the fitness of each individual 



WHILE NOT finished DO 
FOR (populatioiLJJize / 2) DO 

select two individuals from old generation for mating 

recombine the two individuals to create two offspring 

randomly mutate offspring 

insert offspring into new generation 
END 

IF population has converged THEN 

finished := TRUE 
END 



Beasley et al., "An Overview of Genetic Algorithms: Part I, Fundamentals", University 
Computing 1 5(2): 58-69 ( 1 993). 



in sample preparation chenustry; sample loading; gel material; gel thickness; electric field 
density; clamping/securing of gel in instrument; detection rate and other aspects of the 
electrophoresis process. We have found experimentally that applying a second-order 



where c reflects the linear shift, b reflects the stretch or shrink, and a reflects the rate at which 
this stretch or shrink occurs to a clean fragment pattern provides good normalization of the 
10 clean fragment pattern with a standard fragment pattern for c?q)crirncntal data having variations 
in velocity of up to 45%. Using GAs, the coefficients a, and c can be readily optimized. 

A suitable approach to this optimization uses a binary string as the genotype for each 
individual which is divided into three sections representing the three coefficients as shown in 
Fig. 7. The size of each section is dependent on the range of possible values of each coefficient 



END 



polynomial 



fl[x) = c + bx + ax- 



wo 97/02488 



PCT/US96nil30 



-16- 

and the resolution desired. The phenotype of the individual is determined by decoding each 
section to the corresponding decimal vahie. 

As shown in Fig. 7, a binary string for use in solving the problem presented by this 
invention may contain 32 bits of which 8 bits specify the offset coefficient c, 1 3 bits specify the 
relative velocity b, and 1 1 bits specify the relative acceleration a. The objective function used 
to measure the fitness of an individual is the intersection of the standard fragment pattern and 
an experimental fragment pattern produced by applying the second-order polynomial to the 
experimental fragment pattern. The intersection is defined by the equation 

n 

./U\v)=X^ min(.v.,r,.) 

where x is the experimental fragment pattem, y is the standard fragment pattern and n is the 
number of data points. The intersection will be greatest when the two sequences are perfectly 
aligned. 

Calculating the fitness of each individual is a three step process. First the individual's 
genotype is decoded producing the vahias (phenotypcs) for the three cocfncicnts. Second, the 
coefficients arc plugged into the second-order polynomial and the polynomial is used to modify 
the clean fragment pattem. Third, the intersection of the modified fragment pattem and the 
standard pattem is calculated. The intersection value is then assigned to the individual as its 
fitness value. About 20 generations are needed to align the two sequences using a population 
of 50 individuals with a mutation probability of 0.001 (i.e. 1 out of every 1000 bits mutated 
after crossover). Using conventional computer equipment this can be accomplished in 
approximately 8 seconds. This time period is sufficiently .short that all calculations can be run 
for a standardized period of time, rather than to a selected degree of convergence. This 
substantial simplifies experimental design. 

Occasionally, the second-order polynomial will be unable to normalize the two 
sequences. This is due to variations in the velocity of the experimental fragment pattem which 
are greater than second-order. This is easily handled by using a higher order polynomial, for 
exan^le a third- or fourth-order polynomial, and a larger binary genotype to include the extra 




wo 97/02488 PCT/US9<i/11130 

• 17- 

coefiRcients; or by simply dividing the experimental fragment pattern into segments or windows 
such that each segment's variations are at most second-order. 

Once normahzed fragment patterns have been obtained^ they may be used in varioits 
ways inchiding base-calling and mutation detection. For purposes of determining the complete 
5 sequence of all four bases in the sample polymer, this will generally involve the superposition 
of the normalized fragment patterns for each of the four bases. This can be done by 
designating a starting point or other "alignment point" in each fragment, and aligning those 
points to position the aligned fragment patterns. Altcmativcly, the fragments can be aligned 
using a rcfaence peak as disclosed in US Patent Application Serial No. 08/452,719 filed May 

10 30, 1 995, which is incorporated herein by reference. 

Fig. 8 illustrates the exercise of base-calling of aligned data, as obtained from a 
Pharmacia A.L.F. Sequencer and processed using HELIOS (tm) software. Such base-calling 
may be by any method known in the prior art, using aligned fragment panems for each of the 
four bases to provide a complete sequence. 

1 5 GeneraUy in base-calling, there are two steps, peak detection and sequence correlation. 

The minxmum value used in peak detection vari&s with each sequence and must be set on a pcr- 
run basis. The well-known Fast Fourier Transform version of correlation is used to speed its 
calculation. 

Once the peak maxima are identified and located in time, potential ambiguities are 
20 identified. Absent any potential ambiguities, a sequential record of the nucleotide represented 
by each sequential peak conchides the base-calling exerci.sc. In the diagnostic environment to 
which the present appUcation applied, however, the nucleotide sequence record can be utilized 
to detect specific mutations. This can be accomplished in a variety of ways, including amino 
acid translation, identification of untranslated signal sequences such as start codons, stop 
25 codoas or splice site junctions. A preferred method involves determining correlations of the 
normalized fragment pattems against a standard to obtain specific diagnostic information about 
the presence of mutations. 

To perform this correlation, a region around each identified peak in the standard 
fragment pattern is correlated with the corresponding region in the normalized fragment 
30 pattern. After normalization, the correlation will be low in locations where the two sequences 
differ, i.c., where there us a nucleotide variation because of the high degree of alignment which 



wo 97/02488 PCT/DS5)6/11130 

-18- 

normalization makes possible. However, correlation of a region extending approximately 20 
data points on either side of a peak is desirable to compensate for small discrepancies which 
may remain. The correlation process is then repeated for each peak in the normalized fragment 
pattern. Instances of low correlation for any peak are indicative of a mutation. 
5 The correlation of the peaks of the nomializcd fragment pattern with the standard 

ftagment pattern can be performed in several ways. One approach is to determine a standard 
correlation, using the equation for correlation shown above. When two discrete functions are 
correlated in this manner, a single number is obtained. This number ranges in value from zero 
to some arbitrarily large number the value of which depends upon the two functions being 
1 0 correlated, but which is not predictable a priorL This can create a problem in setting threshold 
levels defining high versus low correlation. It is therefore preferable to use a measure of 
correlation which has defined limits to the range of possible values. 

One such measure of correlation is called the "coefficient of correlation*' which can be 
calculated using the formula 

n 

1 5 where f^^ is the standard deviation of function f, and g^,^ is the standard deviation of function 
g. In this case, the output is normalized to a value of between -1 and 1 , inclusively. A value 
of 1 indicates total correlation, and a value of -1 indicates complete non^correiation. Using 
this method, a gradient of correlation is supplied, and values which arc above a pre-defined 
threshold, i.e., 0.8, could be flagged as suspect. 

20 An alternative to determining the coefficient of correlation is to use the function 



n n 

i=l 1=1 



wo 97/02498 



PCT/US96/11130 



- 19- 

where x and y are the data points of the two fragment patterns being compared. This equation 
provides only a rough correlation value* but relies on little computation to do so. Given the 
large values frequently encountered, the error in this equation may be acceptable. 

To return to the advantages of the invention, it is also noted that use of the standard 
5 fragment pattern allows the resolution of nucleotide sequence where ambiguities occur, such 
as conpressions and loss of single nucleotide resolution. Thus, the present invention permits 
automated analysis of many of the ambiguities which are sinq^ly rejected as uninterpretable 
using by known sequencing techniques and equipment. 

It is found experimentally that it is not always necessary to obtain precise signal 

1 0 maxima for each nucleotide in order to determine the presence or absence of mutation in the 
patient sample. Localized areas which fail to clearly resolve into peaks under high speed 
electrophoresis can still carry enough wave form information to allow accurate interpretation 
of the presence or absence of mutation in the patient sample when the sample fragment pattern 
has been nomialized in accordance with the invention. 

1 5 "Compressions" arc localized areas of fragment pattern anomalies wherein a series of 

bands in aligned data are not separated to the same degree as other nearby bands. Com- 
pressions are thought to result from short hairpin hybridizations at one end of the nucleic acid 
molecule which tend to cause a molecule to travel faster through an electrophoresis gel than 
would be expected on the basis of size. The resulting appearance in the fragment pattern is 

20 illustrated in Fig. 9. These compressions may consist of overlapping peaks within one lane tiiat 
give one large peak, on they may be peaks from different lanes that overlap when combined 
together in the alignment process. 

Nomially, a base-calling method is not able to determine the number or order of the 
bases in the compression, because it is unable to distinguish the correct ordering of bands. 

25 Examination reveals a peak (Peak A) which Ls clearly wider than a singleton peak (Peaks B and 
C) but is otherwise indefinable. The method and system of the instant invention, however, 
assigns the correct order and the con-cct number of nucleotides based on what is known about 
the standard fragment pattern. 

As stated hereinabove, the standard fragment pattern includes regions of compressions 

30 that are typical of a given nucleotide sequence. A compression can be characterized by the 
following features (Fig. 9): 



WOW/02488 PCT/US56/11130 

-20- 

Peak Height (Ph) 
Peak Width at half Ph (Pw) 
Peak Area (Pa, not shown) 
5 Centering of Ph on Pw (Cnt not shown) 

A compression is characterized in the trial runs by thase features, and the ratios 
between the features. An average and standard deviation is calculated for each ratio. The 
more precise and controlled the trial runs have been, the lower the standard deviation will be. 
10 The inclusiveness of the standard deviation must be broad enough to encompass the degree of 
accuracy sought in base-calling. A standard deviation which includes only 90% of samples, 
will permit miscalling in 10% of samples, a number which may or may not be too high to be 
usefully employed. Once ascertained, the compression statistics are recorded in association 
with the ambiguous peak. These statistics arc associated with each compression and herein 
1 S called a "standard compression 

Each standard compression can be assigned a nucleotide base sequence upon careful 
investigation. Researchers resolve conpressions by numerous techniques, which though more 
cund)ersome or less useful, serve to reveal the actual underlying nucleotide sequence. These 
techniques include: sequencing fi-om primers nearer to the compression, sequencing the 
20 opposite strand of DNA, electrophoresis in more highly denaturing conditions, etc. Once the 
actual base sequence is detemiincd, it can be assigned as a group to the compression, thus 
relieving the researcher from further time consuming exercises to resolve it. 

Regions of the nomialized fragment patterns which do not show discrete peaks for 
base^Uing are tested for the existence of known compressions. If no compression is known 
25 for the region, the area is flagged for the human operator to examine as a possible new 
mutation. 

On the other hand, if the locus of the peak indicates that it lies in a region of known 
compression, the ratios of the peak are detemiincd as above. If the peak falls within the 
standard deviation of all the ratios determined from the trial runs, it is then assigned the 
30 sequence of the standard compression. Figs. 9A and 9B identify the actual nucleotides assigned 
to a standard compression. 




wo 97/02488 PCT/DS96/11130 

-21 - 

Where, however, the ratios determined for the compression fall outside of the standard 
deviation, there lies the possibility of mutation. In this case, the ratias of the compression are 
conf)ared to all known and previously observed mutations in the standard compression. If the 
compression falls within any of the previously identified mutations in the region, it may be 
5 identified as corresponding to such a mutation. If the ratios fall outside of any known 
standard, the area is flagged for examination by the human operator as an example of a possible 
new and hitherto unobseived mutation. 

A further application of the present invention is for base-calling beyond the limits of 
single nucleotide resolution. In this case, the standard fragment pattern will define a region 
10 where single nucleotide resolution is not observed. In cases of poor sequencing conditions and 
a weakly resolving apparatus, resolution may fail around 200 nts. In excellent conditions, 
some apparatus are known to produce read-lengths of over 700 nts. In either case, there is a 
point which single nucleotide resolution is lost and base-calling cannot be performed 
accurately. 

15 The instant invention relies on normalization using a standard fragment pattern to 

resolve the ambiguous wave forms beyond the limit of single nucleotide resolution. The 
method is essentially the same as a series of compression analyses as described hereinabove. 
The wave forms beyond the limit of resolution in the standard fragment pattern arc measured 
and ratios between all features are calculated, to create an extended series of standard peaks. 

20 Normalized fragment patterns, prepared as described hereinabove, may be sequentially 
analyzed for consistency with the expected ratios of each peak-like feature. Any wave form 
which does not fall within the parameters of the standard peaks is classified as anomalous and 
flagged for fiirther investigation. 

As noted above, in some applications of the invention it is not necessary to perform 

25 base-calling for all four nucleotidc-specific channels in order to detect mutations. According 
to the instant invention, it is possible to compare any base specific experimental fragment 
pattern, for example the T lane of the patient sample, to the base specific standard fragment 
pattern for that T lane. The features of the standard fragment pattern can be used to identify 
differences within the test lane of the sample and thus provide information about the sample. 

30 This aspect of the invention follows the normalization step described hereinabove. The 

degree of correlation of a window at the preferred normalization is plotted against the shifted 




wo 97/02488 



PCTAJS96/11130 



-22- 



10 



15 



20 



25 



origin data point of the window, effectively describing a cross-correlogram. Fig. 1 0 shows the 
maximum and minimum correlation values obtained across the entire length of the standard 
fragment pattern, as detemiined from a plurality of trial runs, A standard deviation can be 
determined after a sufiici^t number of trial runs. Data from a test sample is also plotted. As 
illustrated, one window is foimd to deviate substantially from its expected degree of 
correlation. The laihire to correlate as expected suggests that the window contains a mutation 
or other difference from the standard. The system of the invention would cause such a window 
to be flagged for closer examination by the human operator. Alternatively, the window could 
be reported directly to the patient file for use in diagnosis. In a further alternative, the window 
would not be reported to the bimian operator, until base-calling had further confirmed that 
there was a mutation present in the area represented by the window. 

In general the cross correlogram mutation detection is a method of "single lane base- 
calling" wherein the signal from a single nucleotide run is used to identify the presence or 
absence of difforenccs from the standard fragment pattern. A useful embodiment of this aspect 
of the invention Ls for identification of infectious diseases in patient samples. Many groups of 
diagnostically-significant bactma, viruses, fungi and the like all contain regions of DMA which 
are unique to an individual species, but which are nevertheless amplifiable using a single set of 
anpUfication primers due to commonality of genetic code within related species. Diagnostic 
tests for such organisms may not quickly distinguish bcnveen species within such groups. 
Using the method of the invention, however, it is possible to quickly classify a sample as 
belonging to one species within a group on the basis of the standard fragment patterns for one 
selected type of nucleotide. This eliminates the need to base-call four lanes of nucleotides and 
effectively allows a DNA sequencing apparatus to mn four times as many samples in the same 
time period as before. 

Thus, in accordance with the present invention, there is provided a method for 
classifying a sample of a nucleic acid as a panicular species within a group of commonly- 
arrqjliflable nucleic acid polymers. The method utilizes at least one sample fragment pattern 
representing the positions of a selected type of nucleic acid base within the sample nucleic acid 
polymer. For each commonly-a^^>lifiabIe species within the group, a set of one or more 
normalization coefficients is determined for the sample fragment pattern. These sets of 
normalization coefficients arc then applied to the sample fragment pattern to obtain a plurality 



W097/D2488 PCTAJS96/11130 

-23- 

of trial fragment patterns, wbidi are correlated with the corresponding standard fragment 
patterns. The sanqjle is classified as belonging to the species for which the trial fragment 
pattern has the highest correlation with its corresponding standard fragment pattern, provided 
that the correlation is over a pre-defined threshold. 
5 Thfe aspect of the invention is useful in identifying which allele of a group of alleles is 

present in a gene. The method is aLso useiul in identifying individual species from among a 
group of genetic variants of a disease-causing microorganism, and in particular genetic variants 
of human immunodeficiency virus. 

A further variation of the invention which may be useful in certain conditions is the 
10 reduction of the experimental and standard fragment patterns into square wave data. Square 
wave data is useful when the signal obtained is highly reproducible from run to run. The main 
advantage of a square wave data format is that it includes a maximum of information content 
and a minimum of noise. 

The standard fragment pattern may be reduced to a square wave by a number of means. 
15 In one method, the transition from zero to one occurs at the inflection point on each slope of 
a peak. The inflection points are found by using the zero crossings of a function that is the 
convolution of the data function with a function that is the second derivative of a gaussian 
pulse that is about one half the width of single base pair pulse in the original data sequence. 
This derives inflection points with relatively little addition of noise due to the differentiation 
20 process. Any data point value greater than the inflection point on that slope of the peak is 
assigned 1 . Any value below the inflection point is assigned 0. 

The peaks on the square wave are identified and assigned nucleotide sequences. Peaks 
may be assigned one or more nucleotides as determined by the human operator on the basis 
of the standard fragment pattern. Peaks are then given identifying characteristics such as a 
25 sequential peak number, a standard peak width, a standard gap width on either side of the peak 
and standard deviations with these characteristics. 

When a sample fragment pattern is obtained, it is reduced to a square wave format, 
again on the basis of the inflection point data as described above. Peak numbers are assigned. 
The sample square wave may then be used in different ways to identify mutations. In one 
30 method, it may be used to align the four different nucleotide data streams as in the method of 
the invention described hereinabove. Alternatively, analysis may be purely statistical. The 




wo 97/02488 PCTAJS96/1U30 

-24- 

peak width and gap width of sample can be directly compared to the standard square wave. 
If the sample characteristics fall within the standard deviation of the standard, taking into 
account permissible elasticity of the peaks, then the sample is concluded to be the same as the 
standard. If the peaks of the sample can not be fit within the terms of the standard, then the 
5 presence of a mutation is concluded and reported. 

The pres^ invention is advantageously implemented using any multipurpose computer 
inchiding those g^eralty referred to as personal computers and mini-computers, progranuned 
to determine normalization coefficients by comparison of an cxperimmtal and a standard 
fiagment pattern. As shown schematically in Fig. 1 1 , such a computer will include at least one 

10 central processor 1 10, for example an Intel 80386. 80486 or Pentium® processor or Motorola 
68030, Motorola 68040 or Power PC 60 K a storage device, .such as a hard disk 1 1 1, for 
storing standard fragment patterns, means for receiving raw or clean experimental fragment 
patterns such as wire 1 12 shown connected to the output of an electrophoresis apparatus 1 13. 
The processor 1 10 is programmed to perform the comparison of the experimental 

1 S fragment pattern and the standard fragment pattern and to determine normalization coefficients 
based on the comparison. This programming may be permanent, as in the case where the 
processor is a dedicated EEPROM, or it may be transient in which case the programming 
instructions are loaded from the storage device or from a floppy diskette or other transportable 
media. 

20 The normalization coefficients may be output from computer, in print form using 

printer 114; on a video display 1 15; or via a communications link 1 16 to another processor 
117. Alternatively or additionally, the nomialization coefficients may be utilized by the 
processor 1 10 to normalize the experimental fragment pattern for use in base-calling or other 
diagnostic evaluation. Thus, the apparatus may also include programming for applying the 

25 normalization coefficients to the experimental fragment pattern to obtain a normalized 
fragment pattern, and for ahgning the normalized fragments patterns and evaluating the nucleic 
acid sequence of the sample therefrom. 



wo 97/02488 PCT/US96/11130 

-25- 



CLAIMS 



1 1 . A method for determining the sequence of bases in a sanple nucleic acid 

2 polymer putatively having a known sequence comprising the steps of: 

3 (a) obtaining at least one raw fragment pattern representing the positions 

4 of one selected type of nucleic acid base within the sample nucleic acid polymer as a function 

5 of migration time or distance; 

6 (b) conditioning the raw fragment pattern to obtain a clean fragment 

7 pattern; 

8 (c) detemnining one or more normalization coefficients for the clean 

9 fragment pattern, said nomnalization coefficients being selected to provide a high degree of 

10 overlap between a normalized fragment pattern obtained by applying the noimalization 

1 1 coefficients to the clean fragment pattern and a standard fragment pattern representing the 

1 2 positions of the selected type of nucleic acid base in a standard nucleic acid polymer actually 

1 3 having the known sequence; 

1 4 (d) applying the normalization coefficients to the clean fragment pattcm to 

1 5 obtain the normalized fragment pattern; and 

16 (e) evaluating the normalized fragment pattcm to determine positions of 

1 7 at least the selected type of base within the sample nucleic acid poljmier. 



1 2. A method according to claim 1 , wherein the normalization coefficients 

2 arc coefficients of a second- or higher-order polynomial. 

1 3. A method according to claim 1 or 2, wherein the normalization 

2 coefficients are determined using Genetic Algorithms. 

1 4. A method according to any of claims 1 - 3, wherein the clean data 

2 fragment Ls divided into a plurality of windows, and wherein separate nomfialization coefficients 

3 are determined for each window. 



wo 97/02488 



PCT/US96/11130 



-26- 



1 S. A method according to claim 4, wherein each window contains 1 00- 1 0,000 

2 data points. 

1 6. A method according to any of claims 1-5, wherein the clean data 

2 fragment is obtained using a feed-back loop to obtain a preferred band-pass filter. 

1 7. A method according to any of claims 1 - 6, wherein the base-calling step 

2 resohres a non-singleton peak in the normalized fragment pattern by statistical comparison of 

3 measurements of the non-singleton peak with standard values associated with a corresponding 

4 peak in the standard fragment pattern. 

1 8. A m^hod according to any of claims 1 - 6, wherein the clean fragment 

2 pattern and standard fragment pattern are reduced to square waves. 

1 9. A method according to any of claims 1 - 8, wherein four raw fragment 

2 patterns are obtained, one for each nucleic acid base, and four normalized patterns are 

3 produced, further comprising the step of aligning the four normalized fragment patterns and 

4 then evaluating the aligned normalized fragment patterns by base-calling to determine the 

5 positions of all base types in the sample nucleic acid polymer. 

1 10, A method according to any of claims 1 - 8, wherein only one raw 

2 fragment pattern is obtained, and the positions of only the selected type of base within the 

3 sanq)lc nucleic acid polymer are determined. 

1 1 1 . A method according to claim 1 0, wherein the position of the selected 

2 type of base within the sample nucleic acid polymer arc determined by comparing the 

3 normalized fragment pattern to the standard fragment pattern and noting the presence or 

4 absence of each peak. 



1 12. A method for evaluating the sequence of a nucleic acid polymer 

2 putatively having a known sequence wherein oligonucleotide fragments reflecting the position 



wo 97/02488 PCrAJS96/lU30 

-27- 

3 of nucleic acid bases within the nucleic acid polymer are separated in space or time and then 

4 detected as a fragment pattern which is evaluated to determine the sequence of the nucleic acid 

5 polymer, characterized by the steps of: 

6 (a) determining one or more normalization coefficients for the fragment 

7 pattern, said nomfialization coefficients being selected to provide a high degree of overlap 

8 between a norafialized fragment pattern obtained by applying the normalization coefficients to 

9 the fragment pattern and a standard fragment pattern representing the positions of the selected 

1 0 nucleic add base in a standard nucleic acid polymer actually having the known sequence; and 

1 1 (b) applying the nomfialization coefficients to the fragment pattern prior to 

12 evaluaticm of the fragment pattern to detemiinc the sequence of the nucleic acid polymer. 

1 1 3. A method according to claim 1 2, wherein the normalization coefficients 

2 are coefficients of a second- or higher-order polynomial. 

1 14. A method according to any of clainLS 12 or 13, wherein the 

2 normalization coefficients arc determined using Genetic Algorithms. . 

1 15. A method according to any of claims 12 - 14, wherein the fragment 

2 pattern is divided into a phiraUty of windows, and wherein separate normalization coefficients 

3 are determined for each window. 

1 16. A method for detecting mutations in a sample nucleic acid polymer 

2 having a putatively nomial genetic sequence comprising the steps of: 

3 (a) obtaining at least one sample fragment pattern representing the positions 

4 of a selected nucleic acid base within the sample nucleic acid polymer; 

5 (b) determining one or more normalization coefficients for the sample 

6 fragment pattern, said normalization coefficients being selected to provide a high degree of 

7 overlap between a normalized fragment pattern obtained by applying the normalization 

8 coefficients to the sample fragment pattern and a standard fragment pattern representing the 

9 positions of the selected nucleic acid base in a standard nucleic acid polymer actually having 
1 0 the known sequence; 



wo 97/02488 



PCT/US96/11130 



.28- 



1 1 (c) appfying the normalization coefficients to the sample fragment pattern 

12 to obtain the normalized fragment pattern; 

13 (d) dividing the normalized fragment pattern into a plurality of windows; 

14 and 

15 (e) determining the correlation between each window and the standard 

16 fragment pattern; wherein a difference between the correlation for any window and prc- 

1 7 detmnined standard correlation vahies reflects the presence of a mutation within that window. 

1 1 7. A method according to claim 1 6, whcrdn the normalization coefficients 

2 are coefficients of a second- or higher-order polynomial. 

1 18. A method according to claim 16 or 17, wherein the normalization 

2 coefficients are determined using Genetic Algorithnxs. 

1 19. An apparatus for normalizing an experimental nucleic acid fragment 

2 pattern putatively representing a known nucleic acid sequence comprising: 

3 (a) a computer processor; 

4 (b) a storage device having stored thereon a standard fragment pattern for 

5 a nucleic acid polymer actually having the known sequence; 

6 (c) nrieaas for receiving the experimental nucleic acid iragment pattern; and 

7 (d) meaas for causing the computer processor to determine one or more 

8 normalization coefficients for the experimental fragment pattcm. .said normalization coefficients 

9 being selected to provide a high degree of overlap between a normalized fragment pattern 

1 0 obtained by applying the normalization cocfFicicnts to the experimental fragment pattern and 

1 1 the standard fragment pattern. 

1 20. An apparatus according to claim 1 9, wherein the means for causing the 

2 computer processor to detcmiine one or more normalization coefficients is a program stored 

3 on the storage device. 



wo 97/02488 PCTAJS96/11130 



-29- 

1 2 1 . An apparatus according to claim 1 9 or 20, wherein the nonnalization 

2 coefficients are coefficients of a second- or higher-order polynomial. 

1 22. An apparatus according to any of claims 19 - 21, wherein the 

2 nonnalization coefifidents are determined using Genetic Algorithms. 

1 23. An apparatus according to any of claims 19-22, further comprising 

2 means for applying the noraoalization coefficients to the experimental fragment pattern to 

3 obtain a normalized fragment pattern. 

1 24. An apparatus according to claim 23, further comprising means for 

2 aligning the normalized fragments patterns and evaluating the nucleic add sequence of the 

3 experimental fragment patteiii therefrom. 

1 25. An apparatus according to claim any of claims 19-24, further 

2 comprising an electrophoresis apparatus, opcrativeiy coupled to the means for receiving an 

3 experimental fragment pattern. 

1 26. An apparatus accoixiing to any of claims 1 9 - 25, wherein the means for 

2 causing the computer processor to determine one or more normalization coefficients is a 

3 program stored on the storage device. 

1 27. A method for classifying a sample of a nucleic acid as a particular 

2 species within a group of commonly-amplifiabic nucleic acid polymers, comprising the steps 

3 of: 

4 (a) obtaining at least one sample fi agmcnt pattern representing the positions 

5 of a selected nucleic acid base within the sample nucleic acid polymer, 

6 (b) for each common ly-amplifiable species within the group, determining 

7 a set of one or more normalization coefficients for the sample fragment pattern, said 

8 normalization coefficients being selected to provide a high degree of overlap between a 

9 normalized fragment pattern obtained by applying the normalization coefficients to the sample 



wo 97/02488 



PCT/US9IS/11130 



-30- 

10 fragment pattern and a standard fragment pattern representing the positions of the selected 

1 1 nucleic acid base within a standard nucleic acid polymer actually belonging to one of the 

1 2 commonly amplifiable specie; 

13 (c) applying the sets nomtialization coefficients to the sample fragment 

14 pattern to obtain a plurality of trial fragment patterns; and 

1 5 (d) correlating the trial fragment patterns with the corresponding standard 

16 fragment patterns, wherein the sample is classified as belonging to the species for which the 

17 trial fragment pattern has the highest correlation with its corresponding standard fragment 

1 8 pattern, provided that the correlation is over a pre-dcfincd threshold. 

1 28. A method according to claim 27, wherein the group of commonly- 

2 ampHfrable nucleic acid polymers comprises a plurality of alleles of a single gene. 

1 29. A method according to claim 27, wherein the group of commonly- 

2 amplifiable nucleic acid polymers comprises a plurality a genetic variants of a disease-causing 

3 microorganism. . .. 



1 30. A method according to claim 29, wherein the disease-causing 

2 microorganism is human immunodeficiency vims. 



wo 97702488 PCT/DS96»1130 

1/8 




AFTER NOISE FILTERING 




SUBSTITUTE SHEET (RULE 26) 




N0llV13dd00 ° N0llV13dd00 N0llVn3dd00 

SUBSTITUTE SHEET (RULE 26) 



wo 97/10488 



PCT/US96/11130 



3/8 




SUBSTITUTE SHEET (RULE 26) 



wo 97/02488 



4/8 



X 
CO 



8" 

o P 
o < 

LiJ 

q: 

UJ 

u. 

UJ 

QC 
GL 



60 
50 
40 
30 
20 
10 
O, 



->OFFSET 



0 



1000 2000 3000 4000 5000 6000 
SAMPLE DATA POINT 

FIG. 4 

CROSSOVER POINT 



PARENTS: 



1 






110 0 1 




0 110 0 



OFFSPRING: 




FIG. 6 

GENOTYPE 



1 1.0101 1 10101000100011 11001 101101 



V 

b 

FIG. 7 
suBsrmrrE sheet (rulezg^ 



■V- 
C 




SUBSmUTE SHEET (RUl£ 2?) 



PCT/DS9<i/11130 




wo 97/02488 



PCT/US96/11130 



7/8 



PEAK A 




NUCLEOTIDE: A G G G T 

FIG. 9A 




PEAK C 



NUCLEOTIDE: A 6 G G G 

FIG. 9B 



SUBSnrUTE sheet (PULE 26) 



wo 97/02488 



PCT/Il^fiA11130 



8/8 



UJ 

q: 
q: 
O 
o 



bJ 
UJ 

q: 
o 
iij 




+ = UPPER AND LOWER LIMITS OF CORRELATION VALUE 
OBTAINED FROM TRIAL RUNS 

o = CORRELATION VALUES OBTAINED FROM SAMPLE RUN 



1000 2000 3000 4000 5000 
DATA POINT NUMBER OF SHIFTED ORIGIN 

FIG. 10 




110 




-116 



117 



FIG. 11 

SUBSTTTUTE SHEET (RULE 26) 



INTERNATIONAL SEARCH REPORT 



Inter. «al Appliartion No 

PCT/US 96/11130 



A. CLASSinCATION OP SUBJECT MATTER 

IPC 6 GG1N27/447 






According to Intematianal Patent Qassiftcation (IPC) or to both natioiul dasaHcaiion and IPC 




B. FIELDS SEARCHED 


Minimum dociunmtation searched (dassificatioii system followed by ctassificanon Mnbds) 

IPC 6 GBIN 


Documentatiaa searched other than minimum documentation to the extent that such documents are indudcd in the fields searched 


Electronic d 


atB base consulted during the international search (name of data base arxd, where practical, search terzns iscd) 




C. DOCUMENTS CONSIDERED TO BE RELEVANT 


Categoiy- 


Citation of documenti mth indication, where appropriate, of the rdevant passages 


Relevam to claim No. 


A 


DE,A,44 05 251 (OLYMPUS OPTICAL CO. LTD.) 
25 August 1994 
. see abstract 


1 


A 


APPLIED SPECTROSCOPY, 

vol. 46, no. 1, January 1992, FREDERICK, 

MD, US, 

pages 136-141, XPQ0G247314 

L. B. KOUNTY: "AUTOMATED IMAGE ANALYSIS 

FOR DISTORTION COMPENSATION IN SEQUENCING 

GEL ELECTROPHORESIS" 

see the whole document 


1 


A 


US, A, 5 419 825 (H. FUJI I) 30 May 1995 
see abstract; figure 1 


1 


j X| K^urther documentt are listed in the coniinuaiioD of box C 


|)( 1 Patent family members are listed 


in armex* 


* Special categories of cited documents : 

'A* document defining the general state of the art whidk is not 
considered to be of particular idevance 

*E' earlier document but published on or after the international 
filing date 

'L* document which may throw doubts on priority daim(s) or 
whidi is dtcd to cstaUish the putdication date of another 
dtalion or other s^dal reason (as spedfied) 

'O' document referring to an oral disdosure, use, exhibition or 
other means 

*P' document publidied pncr to the intematiottal filing date but 
later than the pilofity date claimed 


T" later document pubtidied afler the international filing date 
or priori^ date and not in conflict with the application but 
died 10 understand the prindple or fheoiy inuleilying the 
tnvcntim 

*X' document of particular rdcvanoe; the daimcd invention 
cannot be oonadered novd or caimot be considered to 
involve an inventive stqp when the document is taken alone 

' Y* document of particular relevance; the daimed invention 
cannot be considered to involve an inventive step when the 
document is combined with oac or more other suda docu- 
ments, such con^nation being obvious to a person ddOcd 
in the art 

document member of the same patent family 


Date of the 


wftuiil cciin|4etion of the intcsnational search 


Date of mailli^ of the inCematioiial n 


!ardi rcpoit 


15 October 1996 


3 1. 10. 96 




Name and i 


nailing address of the B A 

European Patent OfBce, P.B. SSIS PatenHaan 2 
NL • 2280 HV Rijswiik 
Td. 31-70) 340-2040, Tx. 31 6SI epo id. 
Fas 31*70) 340-3016 


Authorized officer 

Duchate11ier» M 



Form PCT/ISA/SO (m 



lihMt)(Jttlyl993> 



page 1 of 2 




1 





INTERNATIONAL SEARCH REPORT 


lata oal AppMcatioii No 

PCT/US 96/11130 


C^Cootiiination) DOCUMENTS CONSIDERED TO BE RELEVANT 


Calegocy * 


Citatum of docameat, widi indiaition, where appropriate, of the relevant passages 


Rdcvant to claim No. 




GB,A.2 225 139 (UNITED KINGDOM ATOMIC 
ENERGY AUTHORITY) 23 May 199G 
see abstract; figure 3 




1 


A 


US,A.5 365 455 (C. TIBBETTS) 15 November 
1994 

cited in the application 
see abstract; figure 1 




1 



Fona PCT/KA/Mi (amtbiuattoo of saooad sheet) ilaly im) 



page 2 of 2 



INTERNATIONAL SEARCH REPORT 



Entet nal Application No 

PCT/US 96/11139 



Patent document 
dted in search report 



Publication 
date 



Patent family 
mentbei(s) 



Publication 
date 



DE-A-4465251 



25-88-94 



JP-A- 
JP-A- 



6241986 
6273329 



92-09-94 
39-99-94 



US-A-5419825 


39-95-95 


JP-A- 
JP-B- 


5107227 
8627258 


27-64-93 
21-03-96 


GB-A-2225139 


23-65-90 


NONE 






US-A-5365455 


15-11-94 


US-A- 


5562773 


26-03-96 



Pom PCT/UA/2I0 (pMcat ftmUy anaes) (July 1992) 



